feat(embeddings): graph-enriched strategy + context window overflow detection by carlos-alm · Pull Request #74 · optave/ops-codegraph-tool

carlos-alm · 2026-02-24T02:40:43Z

Summary

Graph-enriched structured embedding strategy (new default): embeds callers/callees from the dependency graph, extracted parameters, and leading comments instead of raw source code — producing ~100 tokens per symbol vs ~360 avg, with dramatically better semantic signal for natural language search queries
Context window overflow detection (Embeddings: no handling when symbol source exceeds model context window #55): every model in MODELS now has a contextWindow field; buildEmbeddings estimates token count, truncates oversized texts, warns the user, and records truncated_count in embedding_meta
--strategy CLI flag: codegraph embed --strategy structured|source lets users choose between the new graph-enriched approach and the original raw-code strategy
Structured strategy largely eliminates context window overflow (Embeddings: no handling when symbol source exceeds model context window #55) since compact text fits any model, but the safety net is still in place for the source strategy and edge cases

Closes #55, closes #56

Test plan

New tests/search/embedding-strategy.test.js (15 tests): structured text includes graph context (Calls/Called by/Parameters/leading comments), source text uses raw code, overflow detection warns + truncates, strategy stored in metadata, defaults to structured
Existing tests/search/embedder-search.test.js (16 tests): no regressions in search/RRF functionality
Lint clean (Biome)

greptile-apps · 2026-02-24T02:43:15Z

Greptile Summary

Adds a graph-enriched embedding strategy that uses dependency graph context (callers/callees/parameters/comments) instead of raw source code, dramatically reducing token usage from ~360 to ~100 tokens per symbol while improving semantic search quality. Introduces context window overflow detection that estimates tokens, truncates oversized text, and records truncation metadata.

Major changes:

New structured strategy (now default) embeds callers/callees from graph, parameters from signatures, and leading comments
--strategy CLI flag allows choosing between structured and source strategies
All models in MODELS now have contextWindow field for overflow detection
estimateTokens() function provides rough token count (~4 chars per token)
extractLeadingComment() extracts JSDoc/inline comments for better semantic signal
buildEmbeddings() warns and truncates when text exceeds model's context window
Metadata tracks strategy and truncation count in embedding_meta table
15 comprehensive tests validate both strategies, overflow handling, and metadata storage

Confidence Score: 5/5

This PR is safe to merge with minimal risk
Well-designed implementation with comprehensive tests, clear separation of concerns, backward compatibility (defaults to new strategy), and robust error handling. The structured strategy is a significant quality improvement that's well-validated by 15 passing tests.
No files require special attention

Important Files Changed

Filename	Overview
src/embedder.js	Adds graph-enriched embedding strategy, context window overflow detection, and token estimation - well-structured implementation with comprehensive logic
src/cli.js	Adds --strategy flag to embed command, updates models display to show context windows, validates strategy input
src/index.js	Exports new EMBEDDING_STRATEGIES and estimateTokens functions for programmatic API
tests/search/embedding-strategy.test.js	Comprehensive test suite (15 tests) validating structured vs source strategies, overflow detection, metadata storage, and graph context extraction

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[buildEmbeddings] --> B{Strategy?}
    B -->|structured| C[Prepare graph queries<br/>calleesStmt, callersStmt]
    B -->|source| D[Skip graph prep]
    
    C --> E[Loop through nodes by file]
    D --> E
    
    E --> F{Strategy?}
    F -->|structured| G[buildStructuredText<br/>Extract params, graph context, comments]
    F -->|source| H[buildSourceText<br/>Extract raw source lines]
    
    G --> I[estimateTokens]
    H --> I
    
    I --> J{Tokens > contextWindow?}
    J -->|Yes| K[Truncate text<br/>Increment overflowCount]
    J -->|No| L[Keep text as-is]
    
    K --> M[Add to texts array]
    L --> M
    
    M --> N{More nodes?}
    N -->|Yes| E
    N -->|No| O{overflowCount > 0?}
    
    O -->|Yes| P[warn user about truncation]
    O -->|No| Q[Continue]
    
    P --> R[embed texts]
    Q --> R
    
    R --> S[Store embeddings + metadata<br/>including strategy and truncated_count]

_{Last reviewed commit: d5538b7}

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

…etection Addresses #55 and #56 with a combined solution: - Add contextWindow to MODELS registry for all 7 embedding models - New 'structured' embedding strategy (default): uses callers/callees from the dependency graph, extracted parameters, and leading comments instead of raw source code (~100 tokens vs ~360 avg) - Detect and truncate symbols exceeding model context window, with user-facing warning and truncated_count in embedding_meta - Add --strategy flag to CLI embed command (structured|source) - Store strategy in embedding_meta for auditability - Export estimateTokens() and EMBEDDING_STRATEGIES for programmatic use Closes #55, closes #56 Impact: 5 functions changed, 3 affected

Standalone script that compares structured vs source strategies against 14 real search queries on the project's own graph. Reports Hit@1/3/5 and per-query rank comparison. Usage: node tests/search/embedding-benchmark.js [--model minilm] Impact: 1 functions changed, 0 affected

claude · 2026-02-24T02:57:05Z

Claude finished @carlos-alm's task —— View job

PR Review Complete ✅

Conducting comprehensive review of the graph-enriched embedding strategy implementation.

Review Checklist

Phase 1: Root cause analysis - understand problem being solved
Phase 2: Configuration change detection - check for rule relaxation
Phase 3: Test integrity - verify no weakened assertions
Phase 4: Code quality review - bugs, security, performance
Final Assessment: Rating and recommendation

Phase 1: Root Cause Analysis ✅

Problems Being Solved:

Context window overflow (Embeddings: no handling when symbol source exceeds model context window #55): Original embedding strategy was hitting token limits (~360 tokens/symbol average), causing truncation or failures especially with smaller models like minilm (256 token context window)
Poor semantic signal: Raw source code embeddings were verbose and noisy, containing implementation details rather than semantic meaning
Lack of graph context: Original strategy ignored the dependency relationships that are core to codegraph's value proposition

Why This Approach is Justified:

Graph-enriched strategy: Leverages codegraph's core strength (dependency graph) by embedding caller/callee relationships, parameters, and leading comments instead of raw code
Dramatic token reduction: ~100 tokens vs ~360 average - directly addresses overflow issues
Better semantic signal: Comments and function signatures provide clearer intent than implementation code
Backward compatibility: Maintains source strategy as option via --strategy flag
Conservative overflow handling: Still includes detection/truncation as safety net

This is solving real, documented issues with a well-reasoned architectural approach.

Phase 2: Configuration Change Detection ✅

No concerning configuration changes found:

No biome.json, eslint, or other linting configuration changes
No test timeout increases or coverage threshold reductions
No TypeScript strict mode flags disabled
No lint-disable comments or @ts-ignore added
Only files changed are implementation (src/) and comprehensive new tests

This is clean implementation without any rules relaxation.

Phase 3: Test Integrity ✅

Excellent test coverage with 15 comprehensive tests:

Tests cover both structured and source strategies with clear behavioral expectations
Proper mocking: Uses vi.hoisted() and proper spy setup/restoration for console warnings
Real assertions: No weakened expectations - tests verify exact graph context presence (toContain('Calls:'), toContain('multiply'))
Edge case coverage: Context window overflow detection with both strategies
Metadata validation: Tests verify strategy and truncation count stored in DB
Token estimation tests: Validates the ~4 chars/token heuristic

No red flags found:

No test.skip or test.only
No conditional assertions or relaxed expectations
Appropriate use of mocks without hiding real behavior
Tests are testing actual functionality, not implementation details

Phase 4: Code Quality Review ✅

Security: ✅ Clean

All SQL operations use parameterized queries (db.prepare() with ? placeholders)
No dynamic SQL construction or injection vulnerabilities
No eval(), new Function(), or code execution from user input
Buffer operations are safe (storing embedding vectors)

Error Handling: ✅ Robust

Graceful degradation when files can't be read (warn() + continue)
Proper model loading with clear error messages for auth/network issues
Dimension mismatch detection between query and stored embeddings
Context window overflow detection with user warnings

Performance: ✅ Well-optimized

Efficient batch processing with model-specific batch sizes (4-32 items)
File reading batched by file to minimize I/O
Graph queries prepared once and reused
SQLite transaction for bulk inserts
Appropriate indexing (node_id PRIMARY KEY, foreign key constraints)

API Design: ✅ Solid

Backward compatibility maintained (strategy defaults to 'structured')
Clear separation of concerns (buildStructuredText vs buildSourceText)
Consistent parameter validation and error messages
Metadata tracking for debugging (strategy, truncation count, build timestamp)

Code Quality: ✅ High

Well-documented with JSDoc and inline comments
Consistent naming conventions and code style
Good separation of pure functions (estimateTokens, extractLeadingComment)
No code duplication or unnecessary complexity

Root Cause Analysis

Problem Being Solved: Context window overflow and poor semantic signal from raw code embeddings. The original strategy was creating ~360 token embeddings that exceeded model limits and provided noisy semantic signal.

Why This Approach: Graph-enriched structured strategy leverages the dependency graph (codegraph's core value) to create compact (~100 token), semantically rich embeddings from caller/callee relationships, parameters, and comments instead of raw implementation code.

Risk Assessment: Very low risk. Backward compatible, comprehensive tests, no configuration relaxation, proper error handling.

Critical Concerns

None. This is exemplary code with excellent engineering practices.

Final Recommendation

Rating: ⭐⭐⭐⭐⭐ (5/5)
Action: APPROVE
Reasoning: This PR demonstrates exceptional engineering quality - it solves real documented problems with a well-architected solution, includes comprehensive test coverage (15 tests), maintains backward compatibility, has robust error handling, and introduces no security or performance concerns. The graph-enriched approach is a natural evolution that leverages codegraph's core strengths while dramatically improving token efficiency. This is exactly the type of thoughtful, well-tested feature enhancement that should be merged without hesitation.

greptile-apps Bot reviewed Feb 24, 2026

View reviewed changes

github-actions Bot added 2 commits February 23, 2026 19:56

carlos-alm force-pushed the fix/embedding-strategy branch from 76d0c6f to 56a0517 Compare February 24, 2026 02:56

carlos-alm merged commit 44e8146 into main Feb 24, 2026
16 checks passed

carlos-alm deleted the fix/embedding-strategy branch February 24, 2026 03:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(embeddings): graph-enriched strategy + context window overflow detection#74

feat(embeddings): graph-enriched strategy + context window overflow detection#74
carlos-alm merged 2 commits into
mainfrom
fix/embedding-strategy

carlos-alm commented Feb 24, 2026

Uh oh!

greptile-apps Bot commented Feb 24, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

claude Bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Feb 24, 2026

Summary

Test plan

Uh oh!

greptile-apps Bot commented Feb 24, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Complete ✅

Review Checklist

Phase 1: Root Cause Analysis ✅

Phase 2: Configuration Change Detection ✅

Phase 3: Test Integrity ✅

Phase 4: Code Quality Review ✅

Root Cause Analysis

Critical Concerns

Final Recommendation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Feb 24, 2026 •

edited

Loading